Company Check : Cloudflare Outage Was NOT a Cyber Attack

Written by: Paul | Tue, 11/25/2025 - 12:03

Cloudflare CEO Matthew Prince has clarified that its recent global outage was caused by an internal configuration error and a latent software flaw rather than any form of cyber attack.

A Major Disruption Across Large Parts Of The Internet

The outage of internet infrastructure company Cloudflare began at around 11:20 UTC on 18 November 2025 and lasted until shortly after 17:00, disrupting access to many of the world’s most visited platforms. For example, services including X, ChatGPT, Spotify, Shopify, Etsy, Bet365, Canva and multiple gaming platforms experienced periods of failure as Cloudflare’s edge network returned widespread 5xx errors. Cloudflare itself described the disruption as its most serious since 2019, with a significant portion of its global traffic unable to route correctly for several hours.

Symptoms

The symptoms were varied, ranging from slow-loading pages to outright downtime. For example, some users saw error pages stating that Cloudflare could not complete the request and needed the user to “unblock challenges.cloudflare.com”. For businesses that rely on Cloudflare’s CDN, security filtering and DDoS protection, even short periods of failure can stall revenue, block logins, and create customer support backlogs.

Given Cloudflare’s reach (serving a substantial share of global web traffic), the effect was not confined to one sector or region. In fact, millions of individuals and businesses were affected, even if they had no direct relationship with Cloudflare. That level of impact meant early scrutiny was intense and immediate.

Why Many Suspected A Major Cyber Attack

In the early stages, the pattern of failures resembled those of a large-scale DDoS campaign. Cloudflare was already dealing with unusually high-volume attacks from the Aisuru botnet in recent weeks, raising the possibility that this latest incident might have been another escalation. Internal teams initially feared that the sudden spike in errors and fluctuating recovery cycles could reflect a sophisticated threat actor pushing new attack techniques.

The confusion deepened when Cloudflare’s independent status page also went offline. Since it is hosted outside of Cloudflare’s own infrastructure, this coincidence created an impression, inside and outside the company, that a skilled attacker could be targeting both Cloudflare’s infrastructure and the third-party service used for its status platform.

Commentary on social media, as well as early industry analysis, reflected that uncertainty. With so many services dropping offline at once, it seemed easy to assume the incident must have been caused by malicious activity or a previously unseen DDoS vector. Prince has acknowledged that even within Cloudflare, the team initially viewed the outage through that lens.

Prince’s Explanation Of What Actually Happened

Once the situation stabilised, Prince published an unusually detailed account explaining that the outage originated from Cloudflare’s bot management system and the internal processes that feed it. In his statement, he says the root of the problem lay in a configuration change to the permissions in a ClickHouse database cluster that generates a “feature file” used by Cloudflare’s machine learning model for evaluating bot behaviour.

What??

It seems that, according to Mr Prince, the bot management system assigns a “bot score” to every inbound request and to do that, it relies on a regularly refreshed feature file that lists the traits used by the model to classify traffic. This file is updated roughly every five minutes and pushed rapidly across Cloudflare’s entire network.

It seems that, during a planned update to database permissions, the query responsible for generating the feature file began returning duplicate rows from an additional schema. This caused the file to grow significantly. Cloudflare’s proxy software includes a strict limit on how many features can be loaded for performance reasons. When the oversized file arrived, the system attempted to load it, exceeded the limit, and immediately panicked. That panic cascaded into Cloudflare’s core proxy layer, triggering 5xx errors across key services.

Stuck In A Cycle

Not all ClickHouse nodes received the permissions update at the same moment, meaning that Cloudflare’s network then entered a cycle of partial recovery and renewed failure. For example, every five minutes, depending on which node generated the file, the network loaded either a valid configuration or a broken one. That pattern created the unusual “flapping” behaviours seen in error logs and made diagnosis harder.

However, once engineers identified the malformed feature file as the cause, they stopped the automated distribution process, injected a known-good file, and began restarting affected services. Traffic began returning to normal around 14:30 UTC, with full stability achieved by 17:06.

Why The Framing Matters To Cloudflare

Prince’s post was clear and emphatic on one point i.e., that this event did not involve a cyber attack of any kind. The language used in the post, e.g., phrases such as “not caused, directly or indirectly, by a cyber attack”, signalled an intent to remove any ambiguity.

There may be several reasons for this emphasis. For example, Cloudflare operates as a core piece of internet security infrastructure. Any suggestion that the company suffered a breach could have wide-ranging consequences for customer confidence, regulatory compliance, and Cloudflare’s standing as a provider trusted to mitigate threats rather than succumb to them.

Also, transparency is a competitive factor in the infrastructure market. By releasing a highly granular breakdown early, Cloudflare is signalling to customers and regulators that the incident, though serious, stemmed from internal engineering assumptions and can be addressed with engineering changes rather than indicating a persistent security failure.

It’s also the case that many customers, particularly in financial services, government, and regulated sectors, must report cyber incidents to authorities. Establishing that no malicious actor was involved avoids triggering those processes for thousands of Cloudflare customers.

The Wider Impact On Businesses

The outage arrived at a time when the technology sector is already dealing with the operational fallout of several major incidents this year. For example, recent failures at major cloud providers, including AWS and Azure, have contributed to rising concerns about “concentration risk”, i.e., the danger created when many businesses depend on a small number of providers for critical digital infrastructure.

Analysts have estimated that the direct and indirect costs of the Cloudflare outage could actually reach into the hundreds of millions of dollars once downstream impacts on online retailers, payment providers and services built on Shopify, Etsy and other platforms are included. For small and medium-sized UK businesses, downtime during working hours can lead to missed orders, halted support systems, and reduced customer trust.

For regulators, this incident looks like being part of a trend of high-profile disruptions at large providers. Sectors such as financial services already face strict operational resilience requirements, and there is growing speculation that similar expectations may extend to more industries if incidents continue.

How Cloudflare Is Responding

Prince outlined several steps that Cloudflare is now working on to avoid similar scenarios in future. These include:

– Hardening ingestion of internal configuration files so they are subject to the same safety checks as customer-generated inputs.

– Adding stronger global kill switches to stop faulty files before they propagate.

– Improving how the system handles crashes and error reporting.

– Reviewing failure modes across core proxy modules so that a non-essential feature cannot cause critical traffic to fail.

It seems that Cloudflare’s engineering community has welcomed the transparency, though some external practitioners have questioned why a single configuration file was able to impact so much of the network, and why existing safeguards did not prevent it from propagating globally.

Prince has acknowledged the severity of the incident, describing the outage as “deeply painful” for the team and reiterating that Cloudflare views any interruption to its core traffic delivery as unacceptable.

What Does This Mean For Your Business?

Cloudflare’s account of the incident seems to leave little doubt that this was a preventable internal failure rather than an external threat, and that distinction matters for every organisation that relies on it. The explanation shows how a single flawed process can expose structural weaknesses when so much of the internet depends on centralised infrastructure. For UK businesses, the lesson is that operational resilience cannot be outsourced entirely, even to a provider with Cloudflare’s reach and engineering reputation. The incident reinforces the need for realistic contingency planning, multi-vendor architectures where feasible, and a clear understanding of how a supplier’s internal workings can affect day-to-day operations.

There is also a broader industry point here. For example, outages at Cloudflare, AWS, Azure and other major players are now becoming too significant to dismiss as isolated events. They actually highlight weaknesses in how complex cloud ecosystems are built and maintained, as well as the limits of automation when oversight relies on assumptions that may not be tested until something breaks at scale. Prince’s emphasis on transparency is helpful, but it also raises questions about how often configuration-driven risks are being overlooked across the industry and how reliably safeguards are enforced inside systems that evolve at speed.

Stakeholders from regulators to hosting providers will surely be watching how quickly Cloudflare implements its promised changes and how effective those measures prove to be. Investors and enterprise customers may also be looking for signs that the underlying engineering and operational processes are becoming more robust, not just patched in response to this incident. Prince’s framing makes clear that this was not a compromise of Cloudflare’s security perimeter, but the reliance on a single configuration mechanism that could bring down so many services is likely to remain a point of scrutiny.

The most immediate implication for customers is probably a renewed focus on the practical realities of dependency. Even organisations that never interact with Cloudflare directly were affected, which shows how embedded its infrastructure is in the modern web. UK businesses, in particular, may need to reassess where their digital supply chains concentrate risk and how disruption at a provider they do not contract with can still reach them. The outage serves as a reminder that resilience is not just about defending against attackers but preparing for internal faults in external systems that sit far beyond a company’s control.